This dataset contains the daily recorded stock prices (and more) for the company, from day one that they went public.
Twitter, originally founded in 2006, has been publicly-traded since November 2013, when it held an initial public offering that raised $1.8 billion.
Twitter, Inc. operates as a platform for public self-expression and conversation in real-time. The company offers Twitter, a platform that allows users to consume, create, distribute, and discover content.
The company is now in the midst of a battle of take over with Elon Musk for a by out of $46+ billion and since stock prices have soared.
## Date Open High Low Close Adj.Close Volume
## 1 2013-11-07 45.10 50.09 44.00 44.90 44.90 117701600
## 2 2013-11-08 45.93 46.94 40.69 41.65 41.65 27925300
## 3 2013-11-11 40.50 43.00 39.40 42.90 42.90 16113900
## 4 2013-11-12 43.66 43.78 41.83 41.90 41.90 6316700
## 5 2013-11-13 41.03 42.87 40.76 42.60 42.60 8688300
## 6 2013-11-14 42.34 45.67 42.24 44.69 44.69 11099400
The objective of this project is analysis the stock and predict the stock is gain or loss.
The dataset contain 7 variables and 1758 rows data. We need remove all NA value before we start to analysis.
The 2013s data is small size of our sampling, so that we decided to remove it.
We group the data by year and month ’s mean to anaylsis, and we also need transform the volume’s unit as 1 million which will fit our other variables.
## integer(0)
Volume has many outliers, but we know that if a large institution enters the stock market, the price will change drastically, and at the same time, the volume will suddenly rise.
The figure shows the normal distribution, left skew, so this data is valuable for us to analyze.
This step could make us more easier to understand what’s the relationship between thoese variables.
As shown in the boxplot and dot plot above, all variables are distributed with a left skew distribution.Below are histograms showing the sample means of 5000 random samples of sample sizes 10, 20, 30, and 40 following a normal distribution
## Sample Size = 10 Mean = 31.36889 SD = 3.548034
## Sample Size = 20 Mean = 31.22513 SD = 2.55856
## Sample Size = 30 Mean = 31.26072 SD = 2.075447
## Sample Size = 40 Mean = 31.27649 SD = 1.728638
Random forest is an algorithm that integrates multiple trees through the idea of ensemble learning. Its basic unit is a decision tree, and its essence belongs to a major branch of machine learning - the ensemble learning (Ensemble Learning) method. Its essence is to form each tree by randomly picking observations (rows) and variables (columns)
We can summary that Random Forest is Bagging + Decision Tree.
Bagging :bootstrap aggregating.
The essence is to sample multiple times, and let the learning algorithm train for multiple rounds. The training set of each round consists of n training samples randomly selected from the initial training set, and a training sample may appear multiple times in a certain training set or does not appear.
mtry : The case of a single decision tree decision tree.
ntree: Overall size of random forest.
Generally, the selection of mtry is to try one by one until an ideal value is found. The selection of ntree can roughly judge the value when the error in the model is stable through the graph.
Through the test, we know that when mtry is equal to 4, the error value is the smallest, so we choose 4.
## [1] 0.3346717
## [1] 0.3114712
## [1] 0.2975701
## [1] 0.2729026
##
## all 0 1
## Open 0.1678 0.6779 0.2424
## High 0.1543 -0.0536 0.8493
## Low 0.1486 0.6306 0.1863
## Volume 0.0868 0.5461 -0.0553
## Sample size: 1082
## Frequency of class labels: 520, 562
## Number of trees: 400
## Forest terminal node size: 1
## Average no. of terminal nodes: 170.4675
## No. of variables tried at each split: 4
## Total no. of variables: 4
## Resampling used to grow trees: swor
## Resample size used to grow trees: 684
## Analysis: RF-C
## Family: class
## Splitting rule: gini
## Imbalanced ratio: 1.0808
## (OOB) Brier score: 0.18857893
## (OOB) Normalized Brier score: 0.75431572
## (OOB) AUC: 0.78548796
## (OOB) PR-AUC: 0.77186422
## (OOB) G-mean: 0.7176741
## (OOB) Requested performance error: 0.28096118, 0.31923077, 0.2455516
##
## Confusion matrix:
##
## predicted
## observed 0 1 class.error
## 0 354 166 0.3192
## 1 138 424 0.2456
##
## (OOB) Misclassification rate: 0.2809612
## Predicted
## Actual 0 1
## 0 226 112
## 1 77 223